Skip to content

feat(rivetkit): expose low-cardinality metrics routes#5019

Draft
NathanFlurry wants to merge 1 commit into
counter-latency/serverless-docsfrom
counter-latency/rivetkit-metrics-routes
Draft

feat(rivetkit): expose low-cardinality metrics routes#5019
NathanFlurry wants to merge 1 commit into
counter-latency/serverless-docsfrom
counter-latency/rivetkit-metrics-routes

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 11, 2026

PR #5019 Review: feat(rivetkit): expose low-cardinality metrics routes

Overview

This PR makes two interconnected changes:

  1. Metrics refactoring — switches actor metrics from high-cardinality labels (actor_id_gen, actor_key, envoy_key) to a single low-cardinality actor_name label, removes the RetainedActorMetrics cleanup machinery, renames metric namespaces, and drops several per-actor gauge metrics.
  2. New public route API — replaces the diagnostics() method with three separate HTTP-passthrough routes (health, metadata, metrics) exposed via Registry.routes in TypeScript and via new #[napi] methods in the Rust bridge.

Security Concerns

/metrics endpoint no longer requires auth (breaking security invariant)

In registry/http.rs, the auth check was deleted entirely:

// Before: checked RIVETKIT_METRICS_ENABLED + RIVETKIT_METRICS_TOKEN bearer token
// After:
let metrics = crate::metrics_endpoint::render_prometheus_metrics()?;
bytes_response(StatusCode::OK, &metrics.content_type, metrics.body)

The RIVETKIT_METRICS_ENABLED / RIVETKIT_METRICS_TOKEN env-var gate existed so operators could enable scraping without accidentally exposing metrics to any caller who can reach the actor HTTP port. Removing it makes metrics available to everyone.

Similarly, the new Registry.routes.prometheusMetrics(request?: Request) accepts a request but ignores it — the _request underscore name suggests bearer-token auth was planned but not wired up. There is currently no auth layer on the externally-facing metrics route.


Breaking Changes

Metric renames will break existing dashboards and alerts

  • Registry-level "rivet" prefix removed: old names were rivet_{name}; new names embed rivetkit_ or envoy_ directly, so every metric name changes.
  • Specific renames: pegboard_envoy_*envoy_*, actor_*rivetkit_actor_*, actor_kv_*pegboard_actor_kv_*, SQLite commit envoy timings renamed.
  • The SQLITE_METRICS.md doc is updated, but any Grafana dashboards or Prometheus alert rules using old names will silently stop reporting.

Code Quality

No-op methods are dead code, not backwards-compatibility shims

Several methods are converted to empty no-ops:

pub(crate) fn set_queue_depth(&self, _depth: u32) {}
pub(crate) fn set_active_connections(&self, _count: usize) {}
pub(crate) fn set_lifecycle_inbox_depth(&self, _depth: usize) {}
// etc.

Per CLAUDE.md, backwards-compatibility hacks like this should be avoided. If the metric is intentionally dropped, the callers should be updated to remove the call sites too. Keeping empty stubs is misleading — it looks like metrics are tracked when they aren't.

Default for ActorMetrics uses an empty actor_name

impl Default for ActorMetrics {
    fn default() -> Self {
        Self::new("")  // increments actor_active_count{actor_name=""}
    }
}

This creates a live Prometheus label entry with actor_name="" that increments and decrements on drop. Any code path hitting Default::default() produces a confusing actor_name="" entry in rendered metrics.

serve_config_from_js serverless_cache_envoy parameter is always true

Both call sites pass true. Consider hardcoding serverless_cache_envoy: true inside the function rather than exposing it as a parameter.

CoreEnvoyHandle wrapper is a thin newtype

CoreEnvoyHandle wraps EnvoyHandle with only a status() method returning CoreEnvoyStatus. The extra wrapper type adds indirection without meaningful encapsulation; EnvoyHandle could expose these methods directly.

candidates[0]! in #activeConfiguredRegistry

The old code iterated all candidates; the new code silently takes only the first. A comment explaining the single-candidate assumption would help.


Observability Concerns

Removed per-actor granularity is intentional but irreversible

Replacing instance-level labels with just actor_name means all metrics are aggregated by actor type. This is the stated goal (low-cardinality) but removes the ability to debug a single misbehaving actor instance via Prometheus. Ensure the inspector endpoint still covers per-instance diagnostics and this trade-off is documented.

actor_active_count retention behavior changed

The old actor_active gauge was retained for 10 minutes after actor shutdown. The new actor_active_count decrements immediately on Drop. Prometheus may now see the counter drop to 0 between scrape intervals rather than decaying slowly.

Serverless no-envoy health is optimistic

None => health_response(200, "ok", &version),

When serverless mode has no envoy, health returns 200 ok. The equivalent non-serverless case returns 503 starting. The asymmetry may be intentional but should be documented.


Test Coverage

  • Good additions: actor_startup_duration_metrics_render and actor_active_count_tracks_metric_lifetime.
  • Missing: no tests for the new health(), metadata(), or metrics() NAPI methods.
  • Missing: no test verifying that no-op methods don't accidentally produce metric entries.

Minor

  • version: "wasm" is hardcoded in WasmCoreRuntime.registryHealth() instead of using a real version constant.
  • route_package_version fallback returns env!("CARGO_PKG_VERSION") (Rust crate version) before the registry starts — health requests before startup return the wrong version string.
  • Indentation regression in index.d.ts: */ closing comment dropped one space.

Summary

The low-cardinality direction is the right call for production metrics systems. The main blockers are the removed auth on the runtime HTTP handler and the missing auth on the new public prometheusMetrics route. The no-op dead code and metric rename breaks are worth cleaning up before merging.

Generated by claude[bot]

@MasterPtato MasterPtato force-pushed the counter-latency/serverless-docs branch from cb4ffa1 to e646699 Compare May 14, 2026 17:26
@MasterPtato MasterPtato force-pushed the counter-latency/rivetkit-metrics-routes branch from 3a9c1be to 72886b4 Compare May 14, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant